{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=center>\n",
    "<img src=https://github.com/datoujinggzj/WhaleDataScienceProject/blob/master/pic/project_logo.jpg?raw=true width='800' />\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "今天分享关于【缺失值处理】的基本方法汇总\n",
    "\n",
    "```diff\n",
    "- 大家好，我是知识渊博、爱好小酌、嗓音独特，还挺幽默的鲸鲸！\n",
    "```\n",
    "\n",
    "今天分享【Data Preparation】的重要环节：缺失值处理。\n",
    "\n",
    "那么在诊断数据缺失值时，一定要搞清楚为什么会存在缺失值，这将极大影响我们接下来如何处理缺失值！\n",
    "\n",
    "先介绍一下「缺失类型」\n",
    "\n",
    "1. MCAR (Missing completely at Random)\n",
    "2. MAR (Missing at Random)\n",
    "3. MNAR (Missing not at Random)\n",
    "\n",
    "MCAR 的意思是“缺失与任何值之间没有关系”。因此，在这种情况下，可以删除缺失值。为此，可以删除列缺失值；最小化丢失数据或删除行缺失值。\n",
    "\n",
    "MAR 的意思是“缺失与其他观察到的数据之间存在系统关系，但与缺失数据无关”。\n",
    "\n",
    "最后，MNAR 的意思是“缺失与它的值之间存在关系，缺失或非缺失”。\n",
    "\n",
    "---\n",
    "\n",
    "\n",
    "那么如果你是小白的话，我强烈建议花30min阅读本文，本文介绍了常用的4个方法去处理缺失值。\n",
    "\n",
    "\n",
    "（点击下面跳转）\n",
    "\n",
    "1.  [None/NaN 的区别](#None和NaN的区别)\n",
    "2.  [缺失值分类](#缺失值分类)\n",
    "3.  [缺失值查看](#缺失值查看)\n",
    "3.  [缺失值移除](#缺失值移除)\n",
    "4.  [缺失值填充](#缺失值移除)\n",
    "---\n",
    "补充：[对缺失值鲁棒性高的模型汇总](#对缺失值鲁棒性高的模型汇总)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# None和NaN的区别\n",
    "\n",
    "先说说 None/NaN 的区别"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pandas automatically converts the None to a NaN value. \n",
    "# (Be aware that there is a proposal to add a native integer NA to Pandas in the future; as of this writing, it has not been included)\n",
    "df = pd.DataFrame({'A':[1,2,None], \n",
    "                  'B':[5,np.nan,np.nan],\n",
    "                  'C':[1,2,3]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, None, 3, 4], dtype=object)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "None_arr = np.array([1, None, 3, 4])\n",
    "None_arr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dtype('float64')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "NaN_arr = np.array([1, np.NaN, 3, 4])\n",
    "NaN_arr.dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "https://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|Typeclass     | Conversion When Storing NAs | NA Sentinel Value      |\n",
    "|--------------|-----------------------------|------------------------|\n",
    "| ``floating`` | No change                   | ``np.nan``             |\n",
    "| ``object``   | No change                   | ``None`` or ``np.nan`` |\n",
    "| ``integer``  | Cast to ``float64``         | ``np.nan``             |\n",
    "| ``boolean``  | Cast to ``object``          | ``None`` or ``np.nan`` |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 缺失值分类"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rubin在1976年把缺失值归为3类：\n",
    "1. Missing Completely at Random（MCAR）\n",
    "2. Missing at Random（MAR）\n",
    "3. Missing NOT at Random（MNAR）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Missing Completely at Random（MCAR）\n",
    "\n",
    "MCAR是指缺失值产生的原因完全随机，我们无法通过其他已知数据预测该缺失值。\n",
    "\n",
    "2. Missing at Random（MAR）\n",
    "\n",
    "MAR指的是缺失值可以用其他的列解释并预测。比如：你求婚失败了，你不知道为啥，那么你可能要关注一下其他的因子（变量），“她是否爱上别人了？”或者“老丈母娘不喜欢你”等等。\n",
    "\n",
    "3. Missing NOT at Random（MNAR）\n",
    "\n",
    "MNAR表示缺失值的产生有所原因，比如：一个人很胖，所以他不愿意提供自己的体重。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 缺失值查看\n",
    "\n",
    "介绍检查缺失值的简易方法：\n",
    "\n",
    "- isnull()\n",
    "- notnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       A      B      C\n",
       "0  False  False  False\n",
       "1  False   True  False\n",
       "2   True   True  False"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       A      B     C\n",
       "0   True   True  True\n",
       "1   True  False  True\n",
       "2  False  False  True"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.notnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A    0.33\n",
       "B    0.67\n",
       "C    0.00\n",
       "dtype: float64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "round(df.isnull().sum()/df.shape[0],2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A    0.333333\n",
       "B    0.666667\n",
       "C    0.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isnull().sum()/df.shape[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.000000\n",
       "1    0.333333\n",
       "2    0.666667\n",
       "dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isnull().sum(axis=1)/df.shape[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 缺失值移除\n",
    "\n",
    "一般缺失值占比在50%左右即可drop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = df.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     A    B  C\n",
       "0  1.0  5.0  1"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.dropna() # 默认drop axis是行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   C\n",
       "0  1\n",
       "1  2\n",
       "2  3"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.dropna(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   C\n",
       "0  1\n",
       "1  2\n",
       "2  3"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.dropna(axis='columns') # 跟上面一样"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     A    B  C\n",
       "0  1.0  5.0  1\n",
       "1  2.0  NaN  2\n",
       "2  NaN  NaN  3"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1  # inplace 没有设置成 True，所以对 df1 没有原地操作，不影响 df1 本身！"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     A    B  C\n",
       "0  1.0  5.0  1"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.dropna(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     A    B  C\n",
       "0  1.0  5.0  1\n",
       "1  2.0  NaN  2"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna(thresh=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 缺失值填充\n",
    "\n",
    "介绍几个填充缺失值的常用方法：\n",
    "\n",
    "#### univariate 单变量\n",
    "- [fillna](#fillna)\n",
    "- [replace](#replace)\n",
    "- [SimpleImputer](#SimpleImputer)\n",
    "\n",
    "#### multivariate 多变量\n",
    "\n",
    "- [Impyute](#Impyute)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = df.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### fillna"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>填充</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>填充</td>\n",
       "      <td>填充</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     A    B  C\n",
       "0  1.0  5.0  1\n",
       "1  2.0   填充  2\n",
       "2   填充   填充  3"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 固定值填充\n",
    "df2.fillna(value='填充') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     A    B  C\n",
       "0  1.0  5.0  1\n",
       "1  2.0  5.0  2\n",
       "2  2.0  5.0  3"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# method='ffill'\n",
    "df2.fillna(method = 'ffill') # 根据前一个值填充，如果第一个值就是nan，那此方法不适用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     A    B  C\n",
       "0  1.0  5.0  1\n",
       "1  2.0  NaN  2\n",
       "2  NaN  NaN  3"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# method='bfill'\n",
    "df2.fillna(method = 'bfill') # 根据前一个值填充，如果最后的就是nan，那此方法不适用。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### replace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-1.0</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     A    B  C\n",
       "0  1.0  5.0  1\n",
       "1  2.0 -1.0  2\n",
       "2 -1.0 -1.0  3"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# replace\n",
    "df2.replace(np.nan, -1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SimpleImputer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SimpleImputer\n",
    "\n",
    "from sklearn.impute import SimpleImputer\n",
    "imp_mean = SimpleImputer( strategy='mean') #for median imputation replace 'mean' with 'median'\n",
    "imp_mean.fit(df)\n",
    "imputed_df = imp_mean.transform(df)  # return array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1. , 5. , 1. ],\n",
       "       [2. , 5. , 2. ],\n",
       "       [1.5, 5. , 3. ]])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "imputed_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Impyute"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20640\n"
     ]
    }
   ],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from math import sqrt\n",
    "import random\n",
    "import numpy as np\n",
    "random.seed(0)\n",
    "\n",
    "#Fetching the dataset\n",
    "import pandas as pd\n",
    "dataset = fetch_california_housing()\n",
    "train, target = pd.DataFrame(dataset.data), pd.DataFrame(dataset.target)\n",
    "train.columns = ['0','1','2','3','4','5','6','7']\n",
    "train.insert(loc=len(train.columns), column='target', value=target)\n",
    "\n",
    "#Randomly replace 40% of the first column with NaN values\n",
    "column = train['0']\n",
    "print(column.size)\n",
    "missing_pct = int(column.size * 0.4)\n",
    "i = [random.choice(range(column.shape[0])) for _ in range(missing_pct)]\n",
    "column[i] = np.NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from impyute.imputation.cs import fast_knn\n",
    "sys.setrecursionlimit(100000) #Increase the recursion limit of the OS\n",
    "\n",
    "# start the KNN training\n",
    "imputed_training_impyute=fast_knn(train.values, k=30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[   3.30117882,   41.        ,    6.98412698, ...,   37.88      ,\n",
       "        -122.23      ,    4.526     ],\n",
       "       [   8.3014    ,   21.        ,    6.23813708, ...,   37.86      ,\n",
       "        -122.22      ,    3.585     ],\n",
       "       [   4.31326938,   52.        ,    8.28813559, ...,   37.85      ,\n",
       "        -122.24      ,    3.521     ],\n",
       "       ...,\n",
       "       [   1.7       ,   17.        ,    5.20554273, ...,   39.43      ,\n",
       "        -121.22      ,    0.923     ],\n",
       "       [   3.40189346,   18.        ,    5.32951289, ...,   39.43      ,\n",
       "        -121.32      ,    0.847     ],\n",
       "       [   2.3886    ,   16.        ,    5.25471698, ...,   39.37      ,\n",
       "        -121.24      ,    0.894     ]])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "imputed_training_impyute"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "from impyute.imputation.cs import mice\n",
    "imputed_training_MICE=mice(train.values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[   7.04291115,   41.        ,    6.98412698, ...,   37.88      ,\n",
       "        -122.23      ,    4.526     ],\n",
       "       [   8.3014    ,   21.        ,    6.23813708, ...,   37.86      ,\n",
       "        -122.22      ,    3.585     ],\n",
       "       [   6.93190085,   52.        ,    8.28813559, ...,   37.85      ,\n",
       "        -122.24      ,    3.521     ],\n",
       "       ...,\n",
       "       [   1.7       ,   17.        ,    5.20554273, ...,   39.43      ,\n",
       "        -121.22      ,    0.923     ],\n",
       "       [   2.39225991,   18.        ,    5.32951289, ...,   39.43      ,\n",
       "        -121.32      ,    0.847     ],\n",
       "       [   2.3886    ,   16.        ,    5.25471698, ...,   39.37      ,\n",
       "        -121.24      ,    0.894     ]])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "imputed_training_MICE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.impute import KNNImputer\n",
    "imputer = KNNImputer(n_neighbors=30)\n",
    "imputer.fit(train)\n",
    "# transform the dataset\n",
    "imputed_training_KNNImputer = imputer.transform(train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[   3.11527667,   41.        ,    6.98412698, ...,   37.88      ,\n",
       "        -122.23      ,    4.526     ],\n",
       "       [   8.3014    ,   21.        ,    6.23813708, ...,   37.86      ,\n",
       "        -122.22      ,    3.585     ],\n",
       "       [   4.33347333,   52.        ,    8.28813559, ...,   37.85      ,\n",
       "        -122.24      ,    3.521     ],\n",
       "       ...,\n",
       "       [   1.7       ,   17.        ,    5.20554273, ...,   39.43      ,\n",
       "        -121.22      ,    0.923     ],\n",
       "       [   3.64121333,   18.        ,    5.32951289, ...,   39.43      ,\n",
       "        -121.32      ,    0.847     ],\n",
       "       [   2.3886    ,   16.        ,    5.25471698, ...,   39.37      ,\n",
       "        -121.24      ,    0.894     ]])"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "imputed_training_KNNImputer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 对缺失值鲁棒性高的模型汇总"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=center>\n",
    "<img src=https://github.com/datoujinggzj/WhaleDataScienceProject/blob/master/pic/missing_value_algo.png?raw=true width='800' />\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Great Job!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
